13 research outputs found

    Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

    Full text link
    In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours data from each speaker can be selected to train the TTS model. Our system is based on the recently proposed VQTTS that utilizes VQ acoustic feature rather than mel-spectrogram. We introduce additional speaker embeddings and language embeddings to VQTTS for controlling the speaker and language information. In the cross-lingual evaluations where we need to synthesize speech in a cross-lingual speaker's voice, we provide a native speaker's embedding to the acoustic model and the target speaker's embedding to the vocoder. In the subjective MOS listening test on naturalness, our system achieves 4.77 which ranks first.Comment: Accepted by ICASSP 2023 Special Session for Grand Challenge

    EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

    Full text link
    Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to α\alpha and 1−α1-\alpha respectively. The α\alpha here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.Comment: Accepted to ICASSP202

    VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

    Full text link
    The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a traditional regression model while vec2wav uses an additional feature encoder before HifiGAN generator for smoothing the discontinuous quantized feature. Our experiments show that vec2wav achieves better reconstruction performance than HifiGAN when using self-supervised VQ acoustic feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art performance in terms of naturalness among all current publicly available TTS systems.Comment: This version has been removed by arXiv administrators because the submitter did not have the authority to assign the license at the time of submissio

    VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

    Full text link
    Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202

    Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

    Full text link
    Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available to facilitate research in this direction

    UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

    Full text link
    The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing

    Acoustic Word Embeddings for End-to-End Speech Synthesis

    No full text
    The most recent end-to-end speech synthesis systems use phonemes as acoustic input tokens and ignore the information about which word the phonemes come from. However, many words have their specific prosody type, which may significantly affect the naturalness. Prior works have employed pre-trained linguistic word embeddings as TTS system input. However, since linguistic information is not directly relevant to how words are pronounced, TTS quality improvement of these systems is mild. In this paper, we propose a novel and effective way of jointly training acoustic phone and word embeddings for end-to-end TTS systems. Experiments on the LJSpeech dataset show that the acoustic word embeddings dramatically decrease both the training and validation loss in phone-level prosody prediction. Subjective evaluations on naturalness demonstrate that the incorporation of acoustic word embeddings can significantly outperform both pure phone-based system and the TTS system with pre-trained linguistic word embedding

    Research on Seismic Performance and Reinforcement Methods for Self-Centering Rocking Steel Bridge Piers

    No full text
    To study the seismic performance of self-centering circular-section rocking steel bridge piers whose functions can be restored after an earthquake, a high-precision finite element (FE) analysis model of such a bridge piers was established. The hysteresis behavior of concrete-infilled and hollow rocking steel bridge piers was compared. In response to the characteristics of the local deformation of the wall plates and elliptical deformation of the bottom surface, two reinforcement methods for the pier bottom, namely thickening the wall plate and adding longitudinal stiffeners in the plastic zone of the pier bottom, were proposed. The pseudo static analysis of bridge piers was carried out considering the effects of overall design parameters and reinforcement parameters of the pier bottom. The results indicate that the FE model used in this paper can obtain accurate horizontal load-displacement curves of rocking steel bridge piers. The hysteresis curves of the rocking steel bridge piers and infilled concrete rocking steel bridge piers is close, and directly using hollow steel bridge piers can improve the economic efficiency of the design. Compared to adding longitudinal stiffeners, the reinforcement form of thickened wall plates at the pier bottom has a better effect in improving the seismic performance of bridge piers. The reinforcement of the pier bottom has little effect on the energy dissipation capacity of the bridge pier, but it helps to reduce residual displacement and improve lateral stiffness

    Fractional viscoelastic solution of stratum displacement of a shallow tunnel under the surface slope condition

    No full text
    The unified displacement function (UDF) is presented to describe the deformation behaviours of the tunnel profile along with time under the surface slope condition. Based on the discrete Fourier method, the third-order UDF in the physical plane is expanded to the Laurent series in the complex variable plane. The complex variable method is employed to derive the elastic analytical solution of stratum displacement, when the third-order UDF is taken as the displacement boundary condition of tunnel cross-section (DBCTC). The proposed elastic solution agrees well with the results of the finite element method for the consistent model, which verifies the correctness of the proposed analytical solution. Combining the corresponding principle and fractional Generalized Kelvin viscoelastic constitutive model, the fractional viscoelastic solution under the surface slope condition is determined. The time effect of stratum displacement is presented in two aspects: time-dependent DBCTC and time-dependent material parameters. The parameter analysis is performed to investigate influences of deformation modes of the third-order UDF, slope angle, tunnel radius and fractional order on the time effect of stratum vertical and horizontal displacement

    Effects of injection pressure on cavitation and spray in marine diesel engine

    No full text
    Numerical simulation of the cavitation and spray in a marine diesel engine is performed to investigate the effects of injection pressure on the cavitation flow and spray characteristics in the marine diesel engine, which in turn influence atomization and combustion in the cylinder. A two-phase flow model combined with single bubble dynamics and a droplet break-up model are used to simulate cavitation and spray, respectively, and the results are compared to the experimental data. With increasing injection pressure, the pressure fluctuations inside the nozzle become more intense. The spray penetration is proportional to time at the beginning of injection. Higher injection pressure increases the spray angle. In addition, massive structures on spray edge can return to the spray body, whereas the massive structures on the spray head remain unchanged throughout its lifetime. Each additional 20 MPa of injection pressure reduces the Sauter mean diameter by approximately 9%
    corecore